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Abstract —In multi-label classification, the main focus has been 
to develop ways of learning the underlying dependencies between 
labels, and to take advantage of this at classification time. 
Developing better feature-space representations has been pre¬ 
dominantly employed to rednce complexity, e.g., by eliminating 
non-helpful featnre attributes from the input space prior to (or 
during) training. This is an important task, since many multi¬ 
label methods typically create many different copies or views 
of the same input data as they transform it, and considerable 
memory can be saved by taking advantage of redundancy. In this 
paper, we show that a proper development of the feature space 
can make labels less interdependent and easier to model and 
predict at inference time. For this task we use a deep learning 
approach with restricted Boltzmann machines. We present a deep 
network that, in an empirical evaluation, outperforms a number 
of competitive methods from the literature. 

I. Introduction 

Multi-label classification is the supervised learning problem 
where an instance may be associated with multiple labels. This 
is opposed to the traditional task of single-label classification 
(i.e., multi-class, or binary) where each instance is only 
associated with a single class label. The multi-label context 
is receiving increased attention and is applicable to a wide 
variety of domains, including text, audio data, still images and 
video, and bioinformatics, CZl, EZl, 1231 and the references 
therein. 

The most well-known approach to multi-label classification 
is to simply train an independent classifier for each label. 
This is usually known in the literature as the binary relevance 
(BR) transformation, e.g., ED, E). Essentially, a multi-label 
problem is transformed into one binary problem for each 
label and any off-the-shelf binary classifier is applied to each 
of these problems individually. Practically all the multi-label 
literature identifies that this method is limited by the fact that 
dependencies between labels are not explicitly modelled and 
proposes algorithms to take these dependencies into account. 

To date, many successful multi-label algorithms have been 
obtained by the so-called problem transformation methods 
(where the multi-label problem is transformed into several 
multi-class or binary problems), for example, na, 0, m, 
ED, la. These methods make many copies of the feature 
space in memory (or make many passes over it). Most of the 
highest performing methods also use ensembles, for example 
with support vector machines (SVMs) na, El, decision trees 
ifTSl . probabilistic methods ll^ . Il 28 l or boosting ifTTll . ESll . 

That is to say, most competitive methods from the large 
part of the literature could benefit tremendously from more 
concise representations of the feature space, relatively much 
more so than in the singe-label context; the initial investment 
in reducing the number of feature variables in a multi-label 
problem is much more likely to offer considerable speed-ups 


during learning and classification. However, relatively little 
work in the multi-label literature has considered this approach. 

Using the raw instance data to construct a model makes 
the implicit assumption that the labels originate from this 
data and that they can be recovered directly from it. Usually, 
however, both the labels and the feature variables originate 
from particular abstract concepts. For example, we generally 
think of an image as being labelled beach, not because its 
pixel-data vector is beach-like, but rather because the image 
itself meets some criteria of our abstract idea of what a beach 
is. Ideally then, a feature set would include (for example) 
variables for a grainy surface such as sand or pebbles, and 
for being adjacent to a (significant) body of water. Hence, 
it is highly desirable to recover the hidden dependencies and 
structure from the original concepts behind the learning task. A 
good representation of these dependencies make the problem 
easier to learn. 

A Restricted Boltzmann Machine (RBM) a learns a layer 
of hidden features in an unsupervised fashion. This hidden 
layer can capture complex dependencies and structure from 
the input space, and represent it more compactly (whenever 
the number of hidden units is smaller than the number of 
original feature attributes). The methods we detail in this 
paper using RBMs offer some interesting benefits to multi¬ 
label classification in a variety of domains: 

• The predictive performance of existing state-of-the-art 
methods is generally improved. 

• Many classification paradigms previously relatively un¬ 
competitive in multi-label learning can often obtain much 
higher predictive performance and become competitive 
and thus now offer their respective advantages to this 
context, such as better posterior-probability estimates, 
lower memory consumption, faster performance, easier 
implementation, and incremental learning. 

• The output feature space can be updated incrementally. 
This not only makes incremental learning feasible, but 
also means that cost savings are magnified for batch- 
learners that need to be retrained at intervals on new data. 

• The model can be built using unlabeled examples, which 
are typically obtained much more cheaply than labelled 
examples; especially in multi-label contexts, since exam¬ 
ples are assigned multiple labels. 

We also stack several RBMs to create two varieties of Deep 
Belief Networks (DBNs). We look at two approaches using 
DBNs. In a first approach, we learn the final layer together 
with the labels and use an existing multi-label classifier. In 
a second approach, we use back-propagation to fine-tune the 
weights of our neural network for discriminative prediction, 
and augment this with a second multi-label predictive layer. 

We develop a framework to experiment with RBMs and 
DBNs in a variety of multi-label classification contexts. Within 
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this framework we carry out an empirical evaluation with 
many different methods from the literature, on a collection 
of real-world datasets from diverse domains (to the best of 
our knowledge, this is also the largest and varied collection 
of datasets analysed with an RBM framework). The results 
indicate the benefits of this style of learning for multi-label 
classification. 

II. Prior Work 

Multi-label datasets and classification methods have rapidly 
become more numerous in recent years, and classification 
performance has steadily improved. An overview of the most 
well known and influential work in this area is provided in 

lEa, HU. 

The binary relevance approach (BR) does not obtain high 
predictive performance because it does not model dependen¬ 
cies between labels. A number of methods have improved on 
this predictive performance with methods that do model label 
dependence. 

A well-known alternative is the label powerset (LP) method 
||2^ which transforms the multi-label problem into single¬ 
label problem with a single class, having the powerset as 
the set of values (i.e., all possible 2^ combinations). In 
LP, label dependencies are modelled directly and predictive 
performance is greater than BR, but computational complexity 
is too high for most practical applications. The complexity 
issue has been addressed in works such as El and lfT3l . The 
former presents RAkEL (RAndom fc-labEL sets), an ensemble 
method that selects m subsets of k labels and uses LP to learn 
each of these subproblems. 

The classifier chain approach (CC) OH has received recent 
attention, for example in 13 and ll26l . This method employs 
one classifier for each label, like BR, but the classifiers are not 
independent. Rather, each classifier predicts the binary rele¬ 
vance of each label given the input space plus the predictions 
of the previous classifiers (hence the chain). 

Another type of binary-classification approach is the pair¬ 
wise transformation method (PW), where a binary model is 
trained for each pair of labels. The predictions result more 
naturally in a set of pairwise preferences than a multi-label 
prediction (thus becoming popular in ranking schemes), but 
PW methods can be adapted to make multi-label predictions, 
for example Q. These methods performs well in several 
domains, although their application can easily be prohibitive 
on many datasets due to its quadratic complexity. 

An alternative to problem transformation is algorithm adap¬ 
tation, where a specific single-label method is adapted directly 
for multi-label classification. MLkNN ISl is a fc-nearest neigh¬ 
bours method adapted for multi-label learning by voting from 
the labels found in the neighbours. IBLR is a related method 
that also incorporates a second layer of logistic regression. 
BPMLL ll^ is a back-propagation neural network adapted for 
multi-label classification by having multiple binary outputs as 
the label variables. 

Processing the feature space of multi-label data has already 
been studied in the literature. Ga presents an overview of 
the main techniques with respect to problem transformation 


methods. In lIZTl a clustering-based supervised approach is 
used to obtain label-specific features for each label. The 
advantages of this method are reduced where label-relevances 
are not trained separately, for example in LP methods (which 
learns all labels together as a single multi-class meta label). 
In any case, this a meta technique that can easily be applied 
independently of other preprocessing and learning techniques, 
such as the one we describe in this paper. 

In redundancy is eliminated from the learning space 
of the BR method by taking random subsets of the training 
space across an ensemble. This work centers on the fact 
that a standard BR approach considers the full input space 
for each label, even though only a subset of the variables 
may be relevant to any particular label. Compressive sensing 
techniques have also been used in the literature for reducing 
the complexity multi-label data by taking advantage of label 
sparsity E], ini. 

These methods are mainly motivated by reducing an al¬ 
gorithm’s running-time by reducing the number of feature 
variables in the input space, rather than learning or modelling 
the dependencies between them. More examples of feature- 
space reduction for multi-label classification are reviewed in 

Ea. 

The authors of m use a fully-connected network closely 
related to a Boltzmann machine for multi-label classification, 
using Gibbs sampling for inference. They use this network 
to model dependencies in the label space for prediction, 
rather than to improve the feature space. Since this is a fully 
connected network, it is tractable only for problems with a 
relatively small number of labels. 

Eigure [T] roughly illustrates the way some of the different 
classifiers model correlations among attributes and labels, 
assuming a linear base classifier. 

Eig. 1; A network view of various classifiers; the connections 
among features and labels. 

(a) BR 
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(c) LP 



(d) PW, CDN 
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III. Deep Learning with Restricted Boltzmann 
Machines 

A well-known approach to deep learning is to model each 
layer of higher level features in a restricted Boltzmann machine 
0. We base our approaches on this strategy. 

A. Preliminaries 


RBMs are energy-based models, where the joint probability 
of visible and hidden units is proportional to the energy 
between them: 

P(x, z) oc 

Hence, by manipulating the energy E we can in turn generate 
the probability P(x, z). Specifically, we minimize the energy 

P(x, z) = —xWz 


In all that follows: X C is the input domain of all pos¬ 
sible feature values. An instance is represented as a vector of 
d feature values x = [xi,..., x^]. The set L = {Ai,..., Xl} 
is the output domain of L possible labels. Each instance x 
is associated with a subset of these labels Y C h typically 
represented by a binary vector y = [?/i,..., where 
Uj = 1 Xj G Y', i.e., Uj = 1 if and only if the jth label is 
associated with instance x, and 0 otherwise. 

We assume a set of training data of N labelled examples 
{(xi ,Yi is the label vector (labelset) assignment of 
the *th example; is the relevance of the jth label to the 
ith example. 

In the BR context, for example, L binary classifiers 
hi,..., hi, are trained, where each hj models the binary 
problem relating to the jth label, such 


by learning the weight matrix W to hnd low energy states. 
Contrastive divergence ID is typically used for this task. 

C. Deep Belief Networks 

RBMs can be stacked to form so-called DBNs 0. The 
RBMs are trained greedily: the first RBM takes the input space 
X and produces output then the second RBM treats 
as if it were the input space, and produces Z^'^\ and so on 
and so forth. 

When used for single-label classification, the hnal output 
layer is typically a softmax function, (which is appropriate 
where only one of the output units should be on, to indicate 
one of K classes). In the following section we outline our 
approach, creating DBNs suitable for multi-label classihcation. 


y = h(x) 

= /ii(x),...,/il(x) 

outputs prediction vector y G {0,1}^ for any test instance x. 


B. Restricted Boltzmann Machines 

A Boltzmann machine is a type of fully-connected neural 
network that can be used to discover the underlying regularities 
of the (observed) training data 0. When many features 
are involved, this type of network is only tractable in the 
restricted Boltzmann machine setting 0, where units are fully 
connected between layers, but are unconnected within layers. 

An RBM learns a layer of u hidden feature variables from 
the original d feature variables of a training set (usually 
u < d). These hidden variables can provide a compact 
representation of the underlying patterns and structure of the 
input. In fact, an RBM can capture 2“ input space regions, 
whereas standard clustering requires 0(2“) parameters and 
examples to capture this much complexity. 

Figure shows an RBM can as a graphical model with two 
sets of nodes: visible (X-variables, shaded) and hidden (Z- 
variables). Each Xj is connected to all by 

weight Wjk (the same for both directions). 



Fig. 2: An RBM with 5 input units and 3 hidden units. Each 
edge is associated with a weight Wjk, which together make 
up weight matrix W. 


IV. Deep Belief Networks (DBNs) for Multi-label 
Classification 

Ideally, an RBM would produce hidden variables that corre¬ 
spond directly to the label variables, and thus we could recover 
the label vector directly given any input vector; i.e., y = 
or deterministically mappable z^^^ y. Unfortunately, this is 
seldom the case, because the abstract hidden variables do not 
need to correspond directly to the labels. However, we should 
expect the hidden layer of data to be more closely related to 
the labels than the original data, and thus it makes sense to 
use it as a feature space to classify instances. 

Hence, by using the hidden space created by the RBM, 
we would expect any multi-label classiher to obtain better 
performance (than when using the original feature space). 
We do this simply by using the hidden representation of each 
instance as the input feature space, and associating it with the 
labels to create training set {{zi,yi)}fL^. We can then train 
any multi-label classiher h on this dataset. To evaluate a test 
instance x, we feed it through the RBM and obtain z from 
the upper layer, and then acquire a prediction y = h(z), and 
thus so for each test instance. 


From here we take two approaches. Since the sub-optimality 
produced by greedy learning is not necessarily harmful to 
many discriminative supervised methods HH, we can treat the 
hnal hidden layer variables Z^ as the feature input variables, 
and train any off-the-shelf multi-label model h that can predict 

y = h(z0 


where z^ is produced by the RBM for some test instance x; 
see Figure 

In a second approach, we add a hnal layer of weights 


on top; see Figure pm Now the structure is similar 
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to the neural network of BPMLL ll29l . except that create the 
layers and initialize the weights using RBMs. Later we will 
show that our methods performs much better. We can employ 
back propagation to hne-tune the network in a supervised 
fashion (with respect to label assignments) as in, for example, 
a (for single-label classification). For a number of epochs, 
each training instance is propagated forward (upward) 
through the network and output as the prediction y^. The 
errors Ei = yi — Yi are then propagated backward through 
the network, updating the weights (previously initialized by 
the RBMs). Due to the initialisation with RBMs, far fewer 
epochs are required than would usually be typical for back 
propagation (and we actually observed that more than around 
100 epochs tends to result in overhtting). 

On both these approaches it is possible to add more depth 
in the form including an additional classihcation layer. In 
the multi-label context, this has previously been done to the 
basic BR method in 0, where a second BR is trained on the 
outputs of the first (a stacking approach). A related technique 
in the neural network context, often called a “skip layer” has 
been used in, e.g., lfT9ll . ifThll . In our case we allow for generic 
classifiers. This helps add some further discriminative power 
for taking into account the dependencies in the label space. 




(b) A DBN where a 3rd hidden layer represents the labels. 


Fig. 3; DBNs for multi-label classihcation. In 3a the output 
space (second hidden layer) can be trained with the label 
space Y by any multi-label classiher. In 3b the labels are 
predicted directly in a third hidden layer. 


Note that we have also experimented with a DBN that 
models the instance space and label space together genera- 
tively P(x, y, z). In the multi-label setting this complicates 
the inference, since there are 2^ possible y. We tried using 
Gibbs sampling, but could not obtain competitive results from 
this model in the multi-label setting compared to our other 
approaches (even after reducing x in an RBM hrst). However, 
this seems like an interesting direction, and we intend to follow 
this idea further in future work. 


V. Experiments 

We carry out an empirical evaluation to gauge the effec¬ 
tiveness and efficiency of RBMs and DBNs in a number of 
different multi-label classihcation scenarios, using different 
learning algorithms and a wide collection of databases. We 
have implemented these methods in the MEKA frameworlQ 
an open-source Java-based framework with a number of impor¬ 
tant benchmark multi-label methods. In this framework RBMs 
can easily be used in a wide variety of multi-label schemes. 
The source code of our implementations will be made available 
as part of the MEKA framework. 

We selected commonly-used datasets from a variety of 
domains, listed in Table along with some basic statistics 
about them. The datasets vary considerably with respect 
to the type of data, and their dimensions (the number of 
labels, features, and examples). In Music, instances of music 
are associated with emotions; in Scene, images belong to 
categories; in Yeast proteins may be associated with multiple 
biological functions, and in Genbase gene sequences. Medical, 
Enron and Reuters are text datasets where text documents 
are associated with categories. These datasets are described 
in greater detail in ca. 


TABLE I; A collection of multi-label datasets and associated 
statistics, where LC is label cardinality: the average number 
of labels relevant to each example. 



N 

L 

d 

LC 

Type 

Music 

593 

6 

72 

1.87 

audio 

Scene 

2407 

6 

294 

1.07 

image 

Yeast 

2417 

14 

103 

4.24 

biology 

Genbase 

661 

27 

1185 

1.25 

biology 

Medical 

978 

45 

1449 

1.25 

medical/text 

Enron 

1702 

53 

1001 

3.38 

e-mail/text 

Reuters 

6000 

103 

500 

1.46 

news/text 


A. RBM performance 

We first compare the performance of introducing an RBM, 
blindly trained, for reducing the input dimension and then try 
out three of the common paradigms in multi-label classih¬ 
cation (namely BR, LP and PW) to test the improvements 
proposed for this feature extraction algorithm. The RBM 
would improve the performance of the multi-label classih¬ 
cation paradigms, if the extracted features are relevant for 
better describing the task at hand and will be neutral or 
negative if those features that have been extracted blindly do 
not correspond with relevant features for assigning labels. 

The RBM has several parameters that need to be hne-tuned 
(i.e. number of hidden units, learning rate and momentum) 
and we use three-fold cross validation to set them. We con¬ 
sidered the number of hidden units u S {30,60,120,240}, 
the learning rate rj G {0.1,0.01,0.001}, and momentum 
a G {0.2, 0.4,0.8}. We used weight costs of 2 • 10“^ and 
i? = 1000 epochs throughout. 

* http://meka.sourceforge.net 
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1) Ensemble of Classifier Chains: CC is a competitive BR 
method that uses the chain rule to improve the prediction for 
each potential label. As it is unclear what should be the best 
ordering, we use an ensemble of 50 CC, in which the labels 
are randomly ordered in each realization (as in ifTSll '). In Table 
|IIa[ we report the accuracy, as dehned in El, El, m, m, 
to report the performance of our multi-label classiher^ 


accuracy = 


1 ^ 

-y 

N ^ 

i=l 


|yi Ay^l 
|yi Vyil’ 


where A and V are the bitwise AND and OR functions, 
respectively, for {0,1}^ x {0,1}^ —>■ {0,1}^. 


TABLE II: We compare ECC with and without feature 
extraction using RBMs. 


(a) We report the accuracy for SVM and logistic 
regression based multi-label classifiers. 



SVM 

Log-Reg 

ECCr 

ECC 

ECCr 

ECC 

Music 

0.581 

0.576 

0.558 

0.504 

Scene 

0.731 

0.710 

0.709 

0.554 

Yeast 

0.532 

0.535 

0.513 

0.504 

Genbase 

0.979 

0.981 

0.971 

0.977 

Medical 

0.695 

0.770 

0.449 

0.706 

Enron 

0.469 

0.454 

0.451 

0.355 

Reuters 

0.459 

0.461 

0.408 

0.376 


(b) The parameters chosen for ECCr on the first of the 
two folds (using an internal train/test set of the training 
set). Parameters for the second fold of each dataset were 
invariably similar or identical. 




SVMs 


Log. Reg. 

V 

a 

u 

7 

a 

u 

Music 

0.1 

0.2 

120 

0.1 

0.8 

30 

Scene 

0.1 

0.8 

240 

0.1 

0.8 

60 

Yeast 

0.01 

0.2 

120 

0.01 

0.2 

30 

Genbase 

0.1 

0.8 

120 

0.1 

0.4 

60 

Medical 

0.1 

0.6 

120 

0.1 

0.6 

120 

Enron 

0.1 

0.6 

120 

0.1 

0.6 

120 

Reuters 

0.1 

0.6 

120 

0.1 

0.6 

120 


In Table Ila ECCr and ECC, respectively, denote the accu¬ 
racy of the ECC with the RBM-generated features and with 
the original input space. We have used two different classihers: 
nonlinear SVM and logistic regression (linear classiher), both 
of them have been trained with the default parameters in 
WEKA. It can be seen that the for the logistic regression 
classifier the achieved accuracy with the generated features 
by the RBM are significantly better for the Music, Scene, 
Enron, Reuters datasets, it only underperforms for the Medical 
dataset, and they are comparable for Yeast and Genbase 
datasets. The RBM not only reduces the dimensionality of the 
input space for the classifier, but it also makes the features suit¬ 
able for linear classihers, which allows interpreting the RBM 


^There are a variety of multi-label evaluation measures used in multi-label 
experiments in the literature; (22) provides an overview of some of the most 
popular. The accuracy provides a good balance to gauge the overall predictive 
performance of multi-label methods (m, Qa. 


features and understand how each one of them participate in 
the prediction for each label. 

Eor the SVM-based ECC classihers there is not a signihcant 
difference when we use the RBM processed features compared 
to using the raw data directly, as the RBF kernel in the SVM 
can compensate for the preprocessing done by the RBM. In 
this case, almost all the results are comparable, except for 
the Scene and Medical, in which, respectively, the ECCr and 
ECC outperform. We should remark that the linear logistic 
regression is as good as the nonlinear SVM in most cases, 
so it seams that using the RBM features reduces the input 
dimension and makes the classihcation problem easier, as a 
linear classiher performs as well as a state-of-the-art nonlinear 
classiher. 

In Figure we show the accuracy for the seven data bases 
for the ECC and ECCr multi-label classiher with an SVM 
classiher, as a function of the number of hidden units of the 
RBM. In this plot, it can be seen that once we have enough 
features, using the RBM is comparable to not using it and 
it is clear that for the Medical the number of features is too 
little and we would have needed to increase the number of 
extracted feature^ to achieved the same performance as the 
SVM does. 



Fig. 4; The number of hidden units (horizontal axis) and 
corresponding accuracy as compared to accuracy with the 
same methods on the original feature space (horizontal lines). 
For p = 0.1, a = 0.1. 


Finally, in Table III we show the accuracy for the SVM- 
based classiher for the Scene dataset for all the tested combi¬ 
nations of the learning rate and the momentum, in which the 
number of hidden units is hxed to 120. The accuracy for the 
ECC (without RBM generated features) is 0.695 and in this 
case any combination of learning rate and momentum does 
better, which indicates that with a sufficient number of hidden 
units, the RBM learning is quite robust and not overly sensitive 
to hyperparameter settings. 

2 ) RAndom K labEL subsets: RAkEL is a truncated power 
set method in which we try all combinations for 3 labels and 
we report an ensemble with 2L classihers. We use the same 
hyperparameter setting as we did for the ECC to make the 


^We did not do so, to keep the experimental setting uniform for all proposed 
methods, as we think it is important that hyper-parameter setting should he 
general and not finely tuned for each application. 
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Fig. 5; The difference in accuracy (shown here on Music 
and Medical datasets) between baseline BR (dashed lines) and 
more-advanced CC (solid lines) - both built on RBM-produced 
outputs - decreases with more hidden units (horizontal axis). 
For 77 = 0.1, a = 0.1. 

TABLE III: The accuracy of ECC, with an SVM base 
classifier, for fixed number of hidden units u = 120 , and for 
varying learning rate (A) and momentum (a). 


A 

a. 

accuracy 

0.001 

0.2 

0.707 

0.001 

0.4 

0.705 

0.001 

0.8 

0.705 

0.01 

0.2 

0.710 

0.01 

0.4 

0.714 

0.01 

0.8 

0.720 

0.1 

0.2 

0.726 

0.1 

0.4 

0.727 

0.1 

0.8 

0.726 


results comparable across multi-label classification paradigms, 
as reported in Table |IIb| and we report the acuracy in Table |IV| 


TABLE IV: We report the accuracy for RAkEL with and 
without feature extraction using RBMs using an SVM and a 
logistic regression based multi-label classifiers. 


is versatile to learn any nonlinear mapping. 

3) Pairwise Classification: We implemented a pairwise 
approach, namely Eour-class pairWise classifier (FW), in 
which we build models to learn classes yjk G { 00 , 01 , 10 , 11 } 
for each label pair I < j < k < L, dividing each into votes 
for the individual labels {jj and ijk and using a threshold 
at classification time. We find that overall it obtains better 
predictive performance than the pairwise methods that create 
decision boundaries between labels (where yjk S { 01 , 10 }), 
as in 0 , for example, especially with SVMs. We report the 
accuracy in Table using the same hyper parameters as 
we did for the ECC to make the results comparable across 


multi-label classification paradigms, as reported in Table Ilb 


TABLE V: We report the accuracy for FW with and without 
feature extraction using RBMs, using an SVM and a logistic 
regression based multi-label classifiers. 



SVM 

Log-Reg 

FWr 

FW 

FWr 

FW 

Music 

0.578 

0.573 

0.549 

0.492 

Scene 

0.694 

0.649 

0.660 

0.490 

Yeast 

0.537 

0.538 

0.507 

0.495 

Genbase 

0.985 

0.985 

0.949 

0.975 

Medical 

0.571 

0.748 

0.492 

DNF 

Enron 

0.463 

0.408 

0.376 

DNF 


The conclusions are similar to the other two paradigms. 
The linear classifier (logistic regression) does significantly 
better with the RBM generated features than with the original 
input space, while the SVM nonlinear classifier is versatile 
enough to provide accurate predictions with or without RBM 
generated features. Eortunately, the linear classifier with RBM 
generated features is quite close to the SVM-based classifier 
and allows to interpret which RBM features contribute to each 
label, hence we can provide intuitive interpretations for each 
RBM features, while it is hard to get such interpretation from 
the SVM nonlinear mapping. 


SVM Log-Reg 



RAkR 

RAk 

RAkR 

RAk 

Music 

0.581 

0.579 

0.538 

0.465 

Scene 

0.712 

0.684 

0.663 

0.469 

Yeast 

0.537 

0.537 

0.497 

DNF 

Genbase 

0.984 

0.984 

0.968 

0.976 

Medical 

0.652 

0.743 

0.494 

0.639 

Enron 

0.452 

0.413 

0.376 

0.273 

Reuters 

0.342 

0.337 

0.285 

DNF 


The results for this paradigm are similar to the ones that we 
reported for the ECC in the previous section. Eor the logistic 
regression (a linear classifier) the RBM generated features lend 
themselves for accurate predictions when compared with the 
unprocessed features with the same baseline classifier and they 
are comparable to the results achieved for the nonlinear SVM 
classifier. After processing the features with an RBM we might 
not need to rely on a nonlinear classiher. Eor the SVM using 
the RBM generated features does not help, but it does not hurt 
either, in terms of accuracy, as the SVM nonlinear mapping 


B. DBN performance 

After analyzing the performance of the RBM generated 
features, we focus on two DBN structures for multi-label 
classification: 

• DBNgf-f,: a network of two hidden layers, the final of 
which is united with the labels in a new dataset and 
trained with ECC (see Eigure 

• DBN^p: a network of three hidden layers where the 
hnal layer represents the labels; fine-tuned with back 
propagation (see Eigure [Tb] i 

Both setups can be visualised in Eigure where h = in 
the case of DBN^p. 

We use u = d/b hidden units, 1000 RBM epochs, 100 BP 
epochs (on DBN^p), and the best of either a = 0.8, A = 0.1 
and a = 0.8, A = 0.1 on a 67:33 percent internal train/test 
validation (taking advantage of the fact, as we explained 
earlier, that the choice of learning rate and momentum is fairly 
robust given enough hidden units). 
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Fig. 6: A deep learning setup for multi-label classification. 

In Table |VI[ we compare the accuracy for the proposed 
DBMs structures and the previously proposed methods. We 
have also added MLkNN, BPMLL, and IBLR (see Section 
[n| for details). In this table we can see that the DBNg(,(, is 
either the best classifier or close to the best, which give 
sense that the features generated by the second layer improve 
the first layer. For example, the only database (Medical) in 
which the ECCr was not good enough compared to the ECC 
now the DBNg;,;, and DBN^p do almost as good as ECC and 
the performance on the other databases is also improved 
(or not degraded). This structure seems to be amenable 
for multi-label classification and competitive with all the 
proposed paradigms in the literature. 


VI. Conclusions 

Our empirical evaluation over a variety of multi-label 
datasets shows that a selection of high-performing multi-label 
methods from the literature can be improved upon by using 
an RBM-processed feature space. The labels become easier 
to model at training time, and predict at inference time. We 
obtained an improvement of up to 15 percentage points in ac¬ 
curacy than when using the original feature space directly. Our 
study showed that important improvements can be obtained in 
multi-label classification with respect to both scalability and 
predictive performance when using deep learning in the area 
of multi-label classification. As a result, we can recommend to 
multi-labellers to focus more on feature modelling, rather than 
solely on modelling dependencies between the output labels. 
Our multi-label DBN models achieved the best predictive 
performance overall compared with seven competing methods 
from the multi-label literature. 
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TABLE VI: Comparing multi-label methods under accuracy. Highest results are set in boldface. 



DBN? 

bp 

DBNicc 

MLkNN 

IBLR 

ECCr 

ECC 

RAk 

FW 

BPMLL 

Music 

0.577 

0.581 

0.542 

0.545 

0.581 

0.576 

0.579 

0.573 

0.533 

Scene 

0.731 

0.742 

0.696 

0.697 

0.731 

0.710 

0.684 

0.649 

0.552 

Yeast 

0.529 

0.531 

0.537 

0.539 

0.532 

0.535 

0.537 

0.538 

0.491 

Genbase 

0.984 

0.985 

0.950 

0.918 

0.979 

0.981 

0.984 

0.985 

0.049 

Medical 
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0.742 

0.596 
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0.770 

0.743 

0.748 

0.053 

Enron 

0.442 

0.480 

0.353 

0.363 

0.469 

0.454 

0.413 

0.408 

0.144 

Reuters 

0.410 

0.451 

0.408 

0.357 

0.459 

0.461 

0.337 

DNF 

0.004 
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